.libPaths(new = "libR/")
library(VennDiagram)
library(venneuler)
library(RColorBrewer)
library(clusterProfiler)
library(GO.db)
library(enrichplot)
library(tidyverse)

Busco data

Load busco results:

acutogordius_hic <- read_tsv("../results/busco545-Nematomorpha/out_metazoa_odb10/Acutogordius_australiensis-152393-HiC.fa/run_metazoa_odb10/full_table.tsv", skip = 2)
nectonema_draft <- read_tsv("../results/busco545-Nematomorpha/out_metazoa_odb10/Nectonema_munidae-DNA05827-Draft.fa/run_metazoa_odb10/full_table.tsv", skip = 2)

Join all in one data frame:

buscos <- bind_rows("Acutogordius_HiC" = acutogordius_hic,
                    "Nectonema_Draft" = nectonema_draft,
                    .id = "assembly") %>% rename("BuscoID" = "# Busco id")
buscos

Set of all 954 Metazoa buscos:

metazoa_genes <- buscos %>% pull(BuscoID) %>% unique()
#write_csv2(as.data.frame(metazoa_genes),
#           file = "../data/metazoa_buscos.csv", col_names = F)

Exploratory plots

Length histogram:

ggplot(buscos) +
  geom_histogram(aes(x = Length, fill = Status)) +
  scale_fill_brewer(type = "qual") +
  theme_minimal() +
  facet_wrap(~assembly, ncol=1)

Length boxplot:

ggplot(buscos) +
  geom_boxplot(aes(x = Status, y = Length, fill = assembly)) +
  scale_fill_brewer(type = "qual") +
  theme_minimal()

Venn diagram

List of genes present for each sample, and the total metazoan set:

present <- buscos %>%
  filter(Status != "Missing") %>%
  with(., split(BuscoID, assembly))

present$Metazoa_buscos <- metazoa_genes

summary(present)
                 Length Class  Mode     
Acutogordius_HiC 699    -none- character
Nectonema_Draft  663    -none- character
Metazoa_buscos   954    -none- character

Only option that automatically puts smaller sets inside larger set, but circles not proportional:

myCol <- brewer.pal(3, "Pastel2")
vennbuscos <- venn.diagram(x = present, disable.logging = T,
                           filename = '../figures/venn_buscos/venn_buscos.png',
             category.names = c("Acutogordius" , "Nectonema" , "Metazoa BUSCOS"),
             imagetype="png", resolution = 300,
             height = 480, width = 480,
             # Circles
             lwd = 2, lty = 'blank', fill = myCol,
             # Numbers
             cex = .5, fontface = "bold", fontfamily = "sans",
             # Set names
             cat.cex = 0.4, cat.fontface = "bold", cat.default.pos = "text",
             cat.fontfamily = "sans")

Option with correct proportions, but not nested: This is the one used for the main Figure in the manuscript.

pdf(file = "../figures/venn_buscos/venn_prop.pdf", width = 5, height = 5)
  plot(venneuler(c(Acutogordius=78, Nectonema=88, MetazoaBuscos=954,
                 "Acutogordius&Nectonema"=568, "Acutogordius&MetazoaBuscos"=78,
                 "Nectonema&MetazoaBuscos"=88 ,"Acutogordius&Nectonema&MetazoaBuscos"=734)))
dev.off()

Chi-square test for overlap of 220 missing genes. First create a matrix with number of genes for each/both species, taken easily from the venn_buscos.png file created just above.

# cols = present in Acuto, absent in Acuto
# rows = present in Necto, absent in Necto
chimatrix <- matrix(c(568, 88, 78, 220), ncol = 2)
chimatrix
     [,1] [,2]
[1,]  568   78
[2,]   88  220
chisq.test(chimatrix)

    Pearson's Chi-squared test with Yates' continuity correction

data:  chimatrix
X-squared = 339.31, df = 1, p-value < 2.2e-16

Enrichment analysis

Hypergeometric enrichment (= simple enrichment or over-representation analysis) to assess whether the number of missing buscos associated with a given function is larger than expected by chance.

1- Define GO universe, the background set of all GO terms against which we will test for enrichment. It should include any gene that could have been positive (>GO terms of all 954 metazoan buscos and their GO ancestral terms) 2- Define genes significant for variable of interest (>missing buscos in genome of nematomorphs) 3- Calculate proportion of significant genes in gene set 4- Estimate probability and significance (p-value)

GO universe

Start with loading the data we got with a custom python script (buscos_to_GO.py) that retrieves GO ids/terms from the OrthoDB v10 based on the list of all 954 Metazoan Busco IDs. Then filter only for GO terms (there are interpro and other lines there too). The terms themselves were not complete (NA for unclassified GOs), and even those do have classification if you look them up in https://www.ebi.ac.uk/QuickGO/, so here we will only keep the original BuscoID and the GO ids retrieved with the python script. We will use the package GO.db in the following step to get all GO terms and their ontology classification from each GO id.

buscos_GOs <- read_csv("../data/metazoa_buscos_GOterms.csv") %>%
  filter(grepl("GO:", name)) %>%
  filter(!obsolete_term) %>%
  select(BuscoID = orthodb_id, GOID = id, ancestor_terms,
         count, obsolete_term)
New names:Rows: 8456 Columns: 10── Column specification ──────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (7): description, type, id, ontology_type, orthodb_id, name, ancestor_terms
dbl (2): ...1, count
lgl (1): obsolete_term
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
buscos_GOs

The data above only has the specific GO ids/terms for the Busco genes, but not the associated higher level terms. Let’s expand the lists of ancestor_terms, and bind it with the original data set, so that each term (specific and ancestral) has its own line:

metazoanGO <- buscos_GOs %>%
  mutate(ancestor_terms = str_split(ancestor_terms, ",")) %>%
  select(BuscoID, GOID=ancestor_terms) %>%
  unnest(GOID) %>%
  bind_rows(select(buscos_GOs, BuscoID, GOID)) %>%
  rowwise() %>%
  mutate(GO_term = Term(GOID),
         GO_category = Ontology(GOID)) %>%
  ungroup %>%
  filter(!is.na(GO_category)) # GO ids that did not result in any terms
                      # quickgo shows they are secondary ids of other GOs already represented
metazoanGO

Prepare GO universe input files for enrichment analysis: From help of enricher(): TERM2GENE: user input annotation of TERM TO GENE mapping, a data.frame of 2 column with term and gene.

term2gene <- metazoanGO %>%
  split(f = as.factor(.$GO_category)) %>%
  purrr::map(~select(.x, GOID, BuscoID))

term2gene
$BP

$CC

$MF
NA

This one is optional, but it’s how we keep the GO term name: From help of enricher(): TERM2NAME: user input of TERM TO NAME mapping, a data.frame of 2 column with term and name.

term2name <- metazoanGO %>%
  split(f = as.factor(.$GO_category)) %>%
  purrr::map(~select(.x, GOID, GO_term))

term2name
$BP

$CC

$MF
NA

Genes of interest

Input file for set of genes of interest to test for enrichment (should be a set of unique ids): buscos that are missing from both Acutogordius and Nectonema. The code creates a T/F column for whether or not the busco is missing, then groups by BuscoID and sums TRUEs, so that we get a column where 0 = not missing in either, 1 = missing in one of the species, 2 = missing in both species:

overlap_missing <- buscos %>%
  mutate(is_missing = Status == "Missing") %>%
  group_by(BuscoID) %>%
  summarize(Nmissing = sum(is_missing)) %>%
  filter(Nmissing == 2) %>%
  pull(BuscoID)
#overlap_missing

Over-representation analysis

Now put all together in clusterprofiler analysis. For each of the three GO categories (Cellular Component, Molecular Function, Biological Process), run enricher function of clusterprofiler:

enrich_GO <- purrr::map2(.x = term2gene, .y = term2name,
                         ~enricher(gene = overlap_missing, TERM2GENE = .x, TERM2NAME = .y))
# assign ontology to variable (says unknown by default)
enrich_GO[[1]]@ontology <- names(enrich_GO)[1]
enrich_GO[[2]]@ontology <- names(enrich_GO)[2]
enrich_GO[[3]]@ontology <- names(enrich_GO)[3]
enrich_GO#[[1]]@result %>% filter(p.adjust < 0.05)
$BP
#
# over-representation test
#
#...@organism    UNKNOWN 
#...@ontology    BP 
#...@gene    chr [1:220] "107574at33208" "108764at33208" "111730at33208" "111746at33208" "114954at33208" ...
#...pvalues adjusted by 'BH' with cutoff <0.05 
#...37 enriched terms found
'data.frame':   37 obs. of  9 variables:
 $ ID         : chr  "GO:0030030" "GO:0120036" "GO:0044782" "GO:0030031" ...
 $ Description: chr  "cell projection organization" "plasma membrane bounded cell projection organization" "cilium organization" "cell projection assembly" ...
 $ GeneRatio  : chr  "13/139" "13/139" "12/139" "11/139" ...
 $ BgRatio    : chr  "14/673" "14/673" "13/673" "12/673" ...
 $ pvalue     : num  9.03e-09 9.03e-09 4.36e-08 2.08e-07 2.08e-07 ...
 $ p.adjust   : num  1.16e-06 1.16e-06 3.72e-06 8.89e-06 8.89e-06 ...
 $ qvalue     : num  9.08e-07 9.08e-07 2.92e-06 6.98e-06 6.98e-06 ...
 $ geneID     : chr  "111730at33208/119591at33208/157138at33208/218441at33208/257339at33208/284059at33208/315044at33208/334820at33208"| __truncated__ "111730at33208/119591at33208/157138at33208/218441at33208/257339at33208/284059at33208/315044at33208/334820at33208"| __truncated__ "111730at33208/119591at33208/218441at33208/257339at33208/284059at33208/315044at33208/334820at33208/340577at33208"| __truncated__ "111730at33208/119591at33208/218441at33208/257339at33208/284059at33208/315044at33208/334820at33208/340577at33208"| __truncated__ ...
 $ Count      : int  13 13 12 11 11 11 44 15 45 42 ...
#...Citation
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 


$CC
#
# over-representation test
#
#...@organism    UNKNOWN 
#...@ontology    CC 
#...@gene    chr [1:220] "107574at33208" "108764at33208" "111730at33208" "111746at33208" "114954at33208" ...
#...pvalues adjusted by 'BH' with cutoff <0.05 
#...5 enriched terms found
'data.frame':   5 obs. of  9 variables:
 $ ID         : chr  "GO:0042995" "GO:0120025" "GO:0005929" "GO:0015630" ...
 $ Description: chr  "cell projection" "plasma membrane bounded cell projection" "cilium" "microtubule cytoskeleton" ...
 $ GeneRatio  : chr  "10/141" "10/141" "8/141" "11/141" ...
 $ BgRatio    : chr  "13/651" "13/651" "10/651" "17/651" ...
 $ pvalue     : num  2.71e-05 2.71e-05 1.24e-04 1.29e-04 9.47e-04
 $ p.adjust   : num  0.000922 0.000922 0.002196 0.002196 0.012874
 $ qvalue     : num  0.000857 0.000857 0.00204 0.00204 0.011958
 $ geneID     : chr  "137383at33208/157138at33208/315044at33208/334820at33208/340577at33208/378913at33208/518288at33208/534757at33208"| __truncated__ "137383at33208/157138at33208/315044at33208/334820at33208/340577at33208/378913at33208/518288at33208/534757at33208"| __truncated__ "137383at33208/315044at33208/334820at33208/340577at33208/378913at33208/518288at33208/534757at33208/564234at33208" "114954at33208/137383at33208/247183at33208/397659at33208/421732at33208/450643at33208/534757at33208/561248at33208"| __truncated__ ...
 $ Count      : int  10 10 8 11 11
#...Citation
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 


$MF
#
# over-representation test
#
#...@organism    UNKNOWN 
#...@ontology    MF 
#...@gene    chr [1:220] "107574at33208" "108764at33208" "111730at33208" "111746at33208" "114954at33208" ...
#...pvalues adjusted by 'BH' with cutoff <0.05 
#...1 enriched terms found
'data.frame':   1 obs. of  9 variables:
 $ ID         : chr "GO:0005515"
 $ Description: chr "protein binding"
 $ GeneRatio  : chr "34/114"
 $ BgRatio    : chr "78/566"
 $ pvalue     : num 2.59e-07
 $ p.adjust   : num 2.05e-05
 $ qvalue     : num 2.05e-05
 $ geneID     : chr "14776at33208/152327at33208/247183at33208/259581at33208/272605at33208/281693at33208/285966at33208/302279at33208/"| __truncated__
 $ Count      : int 34
#...Citation
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 

Get result df from enrich_GO and add GO definitions and ontology:

sign_GOs <- function(df){
  df %>%
    filter(p.adjust < 0.05) %>%
    mutate(GO_definition = Definition(ID),
           GO_category = Ontology(ID)) %>%
    separate(GeneRatio, into = c("size_missing_term","size_missing_all"),
             sep = '/', remove = F) %>%
    separate(BgRatio, into = c("size_background_term","size_background_all"),
             sep = '/', remove = F) %>%
    mutate_at(vars("size_missing_term","size_missing_all",
                   "size_background_term","size_background_all"), as.numeric) %>%
    mutate(`Missing over background` = size_missing_term/size_background_term) %>%
    select(`GO ID` = ID, `GO term description` = Description, 
           `GO category` = GO_category, `Associated missing BUSCOs` = Count,
           `Missing ratio` = GeneRatio, `Background ratio` = BgRatio,
           `Missing over background`, pvalue, p.adjust, 
           `BUSCO ID` = geneID, `GO term definition` = GO_definition) %>%
    remove_rownames()}

enriched <- enrich_GO %>%
  purrr::map(~sign_GOs(.x@result))
enriched
$BP

$CC

$MF
NA

GeneRatio = k/n, where n is the set of all annotated MISSING buscos, and k is the number of buscos within n that are annotated to that SPECIFIC GO term. The larger the ratio, the more MISSING genes are related to that particular GO category. BgRatio = M/N, where N is the set of all annotated BUSCOS in the background (universal set), and M is the number of buscos within N that are annotated to that SPECIFIC GO term. The larger the ratio, the more OVERALL buscos are annotated for that particular GO category.

By consequence, M should always be larger than k, because it includes the missing buscos, plus any other buscos not missing that are also associated with that specific GO term. Biologically for the nematomorphs, the most relevant enriched genes are the ones where M is the closest to k, which implies that most of the genes in the background set annotated to that category are missing in nematomorphs. (genes with the smallest pvalue)

Count = k, number of annotated missing buscos.

Save supplementary table with significant terms:

bind_rows(enriched) %>% write_csv(., file = "../results/EnrichedGOterms1.csv")

Expand column with which BUSCOs are annotated to each GO term, so that we can count how many unique BUSCOs were significant:

significant_buscos <- bind_rows(enriched) %>%
  mutate(significant_BUSCOs = str_split(`BUSCO ID`, "/")) %>%
  unnest(significant_BUSCOs) %>%
  group_by(`GO category`, significant_BUSCOs) %>%
  summarise(CountGOterms = n()) %>%
  arrange(CountGOterms, `GO category`)
`summarise()` has grouped output by 'GO category'. You can override using the `.groups` argument.
significant_buscos

All unique significant BUSCOs between BP, CC and MF:

unique_significant_buscos
  [1] "111746at33208" "327301at33208" "405257at33208" "585078at33208" "14776at33208"  "152327at33208"
  [7] "247183at33208" "259581at33208" "272605at33208" "281693at33208" "285966at33208" "302279at33208"
 [13] "304388at33208" "359532at33208" "361216at33208" "382000at33208" "386666at33208" "397659at33208"
 [19] "398202at33208" "441575at33208" "447400at33208" "464204at33208" "500808at33208" "509816at33208"
 [25] "515058at33208" "534757at33208" "544919at33208" "558368at33208" "561248at33208" "563359at33208"
 [31] "567650at33208" "568170at33208" "578734at33208" "595010at33208" "608190at33208" "616149at33208"
 [37] "619415at33208" "642773at33208" "464243at33208" "518635at33208" "638541at33208" "114954at33208"
 [43] "157138at33208" "421732at33208" "450643at33208" "631173at33208" "107574at33208" "108764at33208"
 [49] "131848at33208" "200997at33208" "206858at33208" "344611at33208" "365323at33208" "379076at33208"
 [55] "381303at33208" "443305at33208" "454911at33208" "470197at33208" "505718at33208" "508986at33208"
 [61] "56573at33208"  "571944at33208" "589773at33208" "636041at33208" "315044at33208" "334820at33208"
 [67] "340577at33208" "378913at33208" "518288at33208" "260752at33208" "477998at33208" "488635at33208"
 [73] "503828at33208" "580896at33208" "191938at33208" "137383at33208" "564234at33208" "143997at33208"
 [79] "277935at33208" "424646at33208" "432287at33208" "446031at33208" "515528at33208" "544849at33208"
 [85] "545063at33208" "628231at33208" "368565at33208" "514121at33208" "274064at33208" "361100at33208"
 [91] "524148at33208" "560566at33208" "569294at33208" "581543at33208" "239597at33208" "474302at33208"
 [97] "111730at33208" "119591at33208" "218441at33208" "257339at33208" "284059at33208" "295556at33208"
[103] "429011at33208" "433216at33208" "247905at33208" "353768at33208" "436539at33208" "488559at33208"
[109] "495329at33208" "464286at33208" "430701at33208" "92367at33208" 

Order buscos by numbr of GO terms associated with them:

ordered_buscos <- significant_buscos %>%
  group_by(significant_BUSCOs) %>%
  summarize(N_GO = sum(CountGOterms)) %>%
  arrange(N_GO) %>%
  mutate(significant_BUSCOs = fct_reorder(significant_BUSCOs, N_GO))

Plot number of GO terms for each missing BUSCO:

p_significant_buscos <- significant_buscos %>%
  ggplot() +
  geom_col(aes(x = factor(significant_BUSCOs, levels = ordered_buscos$significant_BUSCOs),
               y = CountGOterms, fill = `GO category`)) +
  scale_fill_brewer(type = "qual") +
  xlab("BUSCO ID") + ylab("Number of annotated GO terms") +
  #ggtitle("Number of GO terms for each Missing BUSCO") +
  theme_minimal() +
  theme(axis.text.x = element_blank())
p_significant_buscos

ggsave("../figures/enrichment/buscos.pdf", p_significant_buscos,
       width = 6.75, height = 4, units = "in", useDingbats = F)

Plot enrichment

https://guangchuangyu.github.io/2016/01/go-analysis-using-clusterprofiler/

Enrichment map: Visualizes gene sets as a network. Mutually overlapping gene sets tend to cluster together, making it easier for interpretation. When the similarity between terms meets a certain threshold (default is 0.2, adjusted by parameter ‘min_edge’), there will be edges between terms. The stronger the similarity, the shorter and thicker the edges.

set.seed(5) #2 without clusters also good
p_map <- enrich_GO %>%
  purrr::map(~emapplot(pairwise_termsim(.x),
                       showCategory=18, shadowtext = F,
                       layout.params = list('gem'),
                       cex.params = list(line=.3, category_label = .6, category_node = .7),
                       #cluster.params = list(cluster = T ,legend = T),
                       repel = T))
p_map[[1]]

p_map[[2]]

Save:

p_map_paths <- stringr::str_c("../figures/enrichment/enrichment_map_", names(p_map), ".pdf")
pwalk(.l = list(p_map_paths, p_map),
      ~ggsave(filename = .x, plot = .y,
              width = 10, height = 10, units = "in", useDingbats = F))

Dotplot:

p_dot <- enrich_GO %>%
  purrr::map(~dotplot(.x))
p_dot
$BP

$CC

$MF

Save:

p_dot_paths <- stringr::str_c("../figures/enrichment/enrichment_dot_", names(p_map), ".pdf")
pwalk(.l = list(p_dot_paths, p_dot),
      ~ggsave(filename = .x, plot = .y,
              width = 7, height = 7, units = "in", useDingbats = F))

Similar idea, but based on calculated column of Missing over background BUSCOs:

p_MissingRatio <- bind_rows(enriched) %>%
  ggplot() +
  geom_col(aes(x = reorder(`GO term description`, `Missing over background`),
               y = `Missing over background`,
               fill = `GO category`)) +
  scale_fill_brewer(type = "qual") +
  scale_y_continuous(breaks = c(0,.25,.5,.75,.9), limits = c(0,1)) +
  ylab("Proportion of missing over background BUSCOs") +
  xlab("GO term") +
  #ggtitle("Missing BUSCOs relative to Metazoa set") +
  coord_flip() +
  theme_minimal()
p_MissingRatio

Save:

ggsave("../figures/enrichment/missing_ratio1.pdf", p_MissingRatio,
       width = 6.85, height = 3, units = "in", useDingbats = F)

Gene-concept network, plots linkage of genes and enriched GO terms:

p_network <- enrich_GO %>%
  purrr::map(~cnetplot(.x, #showCategory = 6, # For 6 top BP terms
                       repel = T, shadowtext = 'all')) #, colorEdge = T))
p_network#[[1]]
$BP

$CC

$MF

Only the top 5 significant terms are displayed for simplicity (default), which really only affects BP that has 37 terms total.

Save:

p_net_paths <- stringr::str_c("../figures/enrichment/enrichment_network2_", names(p_map), ".pdf")
pwalk(.l = list(p_net_paths, p_network),
      ~ggsave(filename = .x, plot = .y,
              width = 10, height = 10, units = "in", useDingbats = F))

Directed acyclic graph (DAG) of enriched GO terms:

p_dag <- enrich_GO %>%
  purrr::map(~plotGOgraph(.x, firstSigNodes = 10)) #10 is default

# pdf("../figures/enrichment/enrichment_dag_BP2.pdf") # Saves only first cluster, not the 3
# plot(p_dag[[1]]$complete.dag)
# dev.off()
# 
# pdf("../figures/enrichment/enrichment_dag_CC.pdf")
# plot(p_dag[[2]]$complete.dag)
# dev.off()
# 
# pdf("../figures/enrichment/enrichment_dag_MF.pdf")
# plot(p_dag[[3]]$complete.dag)
# dev.off()
plot(p_dag[[1]]$complete.dag)

plot(p_dag[[2]]$complete.dag)

plot(p_dag[[3]]$complete.dag)

Only the top 10 significant terms are displayed for simplicity (default), which really only affects BP that has 37 terms total. Each node represents a GO term and branches represent the containment relationships and the degree of colors represent the extent of enrichment, with black and white ellipses representing non-significant enrichment and yellow to red representing the gradient from higher to lower p.adjust values (red most significant). The numbers under the GO terms are the ratio of missing buscos annotated with that term over the number of buscos annotated with that term in the reference database (Missing over background column of enriched_GO df). Top 10 of significantly enriched terms are represented in boxes and the rest in ellipses. BP: Among the 37 significantly enriched GO BP terms, the higher numbers were in cilium and cell projection.

Scratch

To quickly look at a GO term: search for id in https://www.ebi.ac.uk/QuickGO/

---
title: "BUSCO and enrichment analyses"
author: Tauana J. Cunha
date: "February 2023"
output:
  html_notebook:
    code_folding: none
    highlight: tango #default, tango, pygments, kate, monochrome, zenburn, textmate
    toc: true
    toc_float:
      collapsed: false
---

```{r}
.libPaths(new = "libR/")
library(VennDiagram)
library(venneuler)
library(RColorBrewer)
library(clusterProfiler)
library(GO.db)
library(enrichplot)
library(tidyverse)
```

## Busco data
Load busco results:
```{r}
acutogordius_hic <- read_tsv("../results/busco545-Nematomorpha/out_metazoa_odb10/Acutogordius_australiensis-152393-HiC.fa/run_metazoa_odb10/full_table.tsv", skip = 2)
nectonema_draft <- read_tsv("../results/busco545-Nematomorpha/out_metazoa_odb10/Nectonema_munidae-DNA05827-Draft.fa/run_metazoa_odb10/full_table.tsv", skip = 2)
```

Join all in one data frame:
```{r}
buscos <- bind_rows("Acutogordius_HiC" = acutogordius_hic,
                    "Nectonema_Draft" = nectonema_draft,
                    .id = "assembly") %>% rename("BuscoID" = "# Busco id")
buscos
```

Set of all 954 Metazoa buscos:
```{r}
metazoa_genes <- buscos %>% pull(BuscoID) %>% unique()
#write_csv2(as.data.frame(metazoa_genes),
#           file = "../data/metazoa_buscos.csv", col_names = F)
```

## Exploratory plots
Length histogram:
```{r}
ggplot(buscos) +
  geom_histogram(aes(x = Length, fill = Status)) +
  scale_fill_brewer(type = "qual") +
  theme_minimal() +
  facet_wrap(~assembly, ncol=1)
```

Length boxplot:
```{r}
ggplot(buscos) +
  geom_boxplot(aes(x = Status, y = Length, fill = assembly)) +
  scale_fill_brewer(type = "qual") +
  theme_minimal()
```

## Venn diagram
List of genes present for each sample, and the total metazoan set:
```{r}
present <- buscos %>%
  filter(Status != "Missing") %>%
  with(., split(BuscoID, assembly))

present$Metazoa_buscos <- metazoa_genes

summary(present)
```

Only option that automatically puts smaller sets inside larger set, but circles not proportional:
```{r}
myCol <- brewer.pal(3, "Pastel2")
vennbuscos <- venn.diagram(x = present, disable.logging = T,
                           filename = '../figures/venn_buscos/venn_buscos.png',
             category.names = c("Acutogordius" , "Nectonema" , "Metazoa BUSCOS"),
             imagetype="png", resolution = 300,
             height = 480, width = 480,
             # Circles
             lwd = 2, lty = 'blank', fill = myCol,
             # Numbers
             cex = .5, fontface = "bold", fontfamily = "sans",
             # Set names
             cat.cex = 0.4, cat.fontface = "bold", cat.default.pos = "text",
             cat.fontfamily = "sans")
```

Option with correct proportions, but not nested:
This is the one used for the main Figure in the manuscript.
```{r}
pdf(file = "../figures/venn_buscos/venn_prop.pdf", width = 5, height = 5)
  plot(venneuler(c(Acutogordius=78, Nectonema=88, MetazoaBuscos=954,
                 "Acutogordius&Nectonema"=568, "Acutogordius&MetazoaBuscos"=78,
                 "Nectonema&MetazoaBuscos"=88 ,"Acutogordius&Nectonema&MetazoaBuscos"=734)))
dev.off()
```

Chi-square test for overlap of 220 missing genes. First create a matrix with number of genes for each/both species, taken easily from the venn_buscos.png file created just above.
```{r}
# cols = present in Acuto, absent in Acuto
# rows = present in Necto, absent in Necto
chimatrix <- matrix(c(568, 88, 78, 220), ncol = 2)
chimatrix

chisq.test(chimatrix)
```


## Enrichment analysis

Hypergeometric enrichment (= simple enrichment or over-representation analysis) to assess whether the number of missing buscos associated with a given function is larger than expected by chance.

1- Define GO universe, the background set of all GO terms against which we will test for enrichment. It should include any gene that could have been positive (>GO terms of all 954 metazoan buscos and their GO ancestral terms)
2- Define genes significant for variable of interest (>missing buscos in genome of nematomorphs)
3- Calculate proportion of significant genes in gene set
4- Estimate probability and significance (p-value)

### GO universe
Start with loading the data we got with a custom python script (buscos_to_GO.py) that retrieves GO ids/terms from the OrthoDB v10 based on the list of all 954 Metazoan Busco IDs. Then filter only for GO terms (there are interpro and other lines there too).
The terms themselves were not complete (NA for unclassified GOs), and even those do have classification if you look them up in https://www.ebi.ac.uk/QuickGO/, so here we will only keep the original BuscoID and the GO ids retrieved with the python script. We will use the package GO.db in the following step to get all GO terms and their ontology classification from each GO id.
```{r}
buscos_GOs <- read_csv("../data/metazoa_buscos_GOterms.csv") %>%
  filter(grepl("GO:", name)) %>%
  filter(!obsolete_term) %>%
  select(BuscoID = orthodb_id, GOID = id, ancestor_terms,
         count, obsolete_term)
buscos_GOs
```

The data above only has the specific GO ids/terms for the Busco genes, but not the associated higher level terms. Let's expand the lists of ancestor_terms, and bind it with the original data set, so that each term (specific and ancestral) has its own line:
```{r}
metazoanGO <- buscos_GOs %>%
  mutate(ancestor_terms = str_split(ancestor_terms, ",")) %>%
  select(BuscoID, GOID=ancestor_terms) %>%
  unnest(GOID) %>%
  bind_rows(select(buscos_GOs, BuscoID, GOID)) %>%
  rowwise() %>%
  mutate(GO_term = Term(GOID),
         GO_category = Ontology(GOID)) %>%
  ungroup %>%
  filter(!is.na(GO_category)) # GO ids that did not result in any terms
                      # quickgo shows they are secondary ids of other GOs already represented
metazoanGO
```

Prepare GO universe input files for enrichment analysis:
From help of enricher(): TERM2GENE: user input annotation of TERM TO GENE mapping, a data.frame of 2 column with term and gene.
```{r}
term2gene <- metazoanGO %>%
  split(f = as.factor(.$GO_category)) %>%
  purrr::map(~select(.x, GOID, BuscoID))

term2gene
```

This one is optional, but it's how we keep the GO term name:
From help of enricher(): TERM2NAME: user input of TERM TO NAME mapping, a data.frame of 2 column with term and name.
```{r}
term2name <- metazoanGO %>%
  split(f = as.factor(.$GO_category)) %>%
  purrr::map(~select(.x, GOID, GO_term))

term2name
```

### Genes of interest
Input file for set of genes of interest to test for enrichment (should be a set of unique ids): buscos that are missing from both Acutogordius and Nectonema. The code creates a T/F column for whether or not the busco is missing, then groups by BuscoID and sums TRUEs, so that we get a column where 0 = not missing in either, 1 = missing in one of the species, 2 = missing in both species:
```{r}
overlap_missing <- buscos %>%
  mutate(is_missing = Status == "Missing") %>%
  group_by(BuscoID) %>%
  summarize(Nmissing = sum(is_missing)) %>%
  filter(Nmissing == 2) %>%
  pull(BuscoID)
#overlap_missing
```

### Over-representation analysis
Now put all together in clusterprofiler analysis. For each of the three GO categories (Cellular Component, Molecular Function, Biological Process), run enricher function of clusterprofiler:
```{r}
enrich_GO <- purrr::map2(.x = term2gene, .y = term2name,
                         ~enricher(gene = overlap_missing, TERM2GENE = .x, TERM2NAME = .y))
# assign ontology to variable (says unknown by default)
enrich_GO[[1]]@ontology <- names(enrich_GO)[1]
enrich_GO[[2]]@ontology <- names(enrich_GO)[2]
enrich_GO[[3]]@ontology <- names(enrich_GO)[3]
enrich_GO#[[1]]@result %>% filter(p.adjust < 0.05)
```

Get result df from enrich_GO and add GO definitions and ontology:
```{r}
sign_GOs <- function(df){
  df %>%
    filter(p.adjust < 0.05) %>%
    mutate(GO_definition = Definition(ID),
           GO_category = Ontology(ID)) %>%
    separate(GeneRatio, into = c("size_missing_term","size_missing_all"),
             sep = '/', remove = F) %>%
    separate(BgRatio, into = c("size_background_term","size_background_all"),
             sep = '/', remove = F) %>%
    mutate_at(vars("size_missing_term","size_missing_all",
                   "size_background_term","size_background_all"), as.numeric) %>%
    mutate(`Missing over background` = size_missing_term/size_background_term) %>%
    select(`GO ID` = ID, `GO term description` = Description, 
           `GO category` = GO_category, `Associated missing BUSCOs` = Count,
           `Missing ratio` = GeneRatio, `Background ratio` = BgRatio,
           `Missing over background`, pvalue, p.adjust, 
           `BUSCO ID` = geneID, `GO term definition` = GO_definition) %>%
    remove_rownames()}

enriched <- enrich_GO %>%
  purrr::map(~sign_GOs(.x@result))
enriched
```

GeneRatio = k/n, where n is the set of *all annotated MISSING buscos*, and k is the number of buscos within n that are *annotated to that SPECIFIC GO term*. The larger the ratio, the more MISSING genes are related to that particular GO category.
BgRatio = M/N, where N is the set of *all annotated BUSCOS in the background* (universal set), and M is the number of buscos within N that are *annotated to that SPECIFIC GO term*. The larger the ratio, the more OVERALL buscos are annotated for that particular GO category.

By consequence, M should always be larger than k, because it includes the missing buscos, plus any other buscos not missing that are also associated with that specific GO term. *Biologically for the nematomorphs, the most relevant enriched genes are the ones where M is the closest to k, which implies that most of the genes in the background set annotated to that category are missing in nematomorphs.* (genes with the smallest pvalue)

Count = k, number of annotated missing buscos.

Save supplementary table with significant terms:
```{r}
bind_rows(enriched) %>% write_csv(., file = "../results/EnrichedGOterms.csv")
```

Expand column with which BUSCOs are annotated to each GO term, so that we can count how many unique BUSCOs were significant:
```{r}
significant_buscos <- bind_rows(enriched) %>%
  mutate(significant_BUSCOs = str_split(`BUSCO ID`, "/")) %>%
  unnest(significant_BUSCOs) %>%
  group_by(`GO category`, significant_BUSCOs) %>%
  summarise(CountGOterms = n()) %>%
  arrange(CountGOterms, `GO category`)
significant_buscos
```

All unique significant BUSCOs between BP, CC and MF:
```{r}
unique_significant_buscos <- significant_buscos %>% pull(significant_BUSCOs) %>% unique
unique_significant_buscos
#unique_significant_buscos %in% overlap_missing
```

Order buscos by numbr of GO terms associated with them:
```{r}
ordered_buscos <- significant_buscos %>%
  group_by(significant_BUSCOs) %>%
  summarize(N_GO = sum(CountGOterms)) %>%
  arrange(N_GO) %>%
  mutate(significant_BUSCOs = fct_reorder(significant_BUSCOs, N_GO))
```

Plot number of GO terms for each missing BUSCO:
```{r}
p_significant_buscos <- significant_buscos %>%
  ggplot() +
  geom_col(aes(x = factor(significant_BUSCOs, levels = ordered_buscos$significant_BUSCOs),
               y = CountGOterms, fill = `GO category`)) +
  scale_fill_brewer(type = "qual") +
  xlab("BUSCO ID") + ylab("Number of annotated GO terms") +
  #ggtitle("Number of GO terms for each Missing BUSCO") +
  theme_minimal() +
  theme(axis.text.x = element_blank())
p_significant_buscos
```

```{r}
ggsave("../figures/enrichment/buscos.pdf", p_significant_buscos,
       width = 6.75, height = 4, units = "in", useDingbats = F)
```


### Plot enrichment
https://guangchuangyu.github.io/2016/01/go-analysis-using-clusterprofiler/

Enrichment map:
Visualizes gene sets as a network. Mutually overlapping gene sets tend to cluster together, making it easier for interpretation. When the similarity between terms meets a certain threshold (default is 0.2, adjusted by parameter 'min_edge'), there will be edges between terms. The stronger the similarity, the shorter and thicker the edges.
```{r}
set.seed(5) #2 without clusters also good
p_map <- enrich_GO %>%
  purrr::map(~emapplot(pairwise_termsim(.x),
                       showCategory=18, shadowtext = F,
                       layout.params = list('gem'),
                       cex.params = list(line=.3, category_label = .6, category_node = .7),
                       #cluster.params = list(cluster = T ,legend = T),
                       repel = T))
p_map[[1]]
p_map[[2]]
```

Save:
```{r}
p_map_paths <- stringr::str_c("../figures/enrichment/enrichment_map_", names(p_map), ".pdf")
pwalk(.l = list(p_map_paths, p_map),
      ~ggsave(filename = .x, plot = .y,
              width = 10, height = 10, units = "in", useDingbats = F))
```


Dotplot:
```{r}
p_dot <- enrich_GO %>%
  purrr::map(~dotplot(.x))
p_dot
```

Save:
```{r}
p_dot_paths <- stringr::str_c("../figures/enrichment/enrichment_dot_", names(p_map), ".pdf")
pwalk(.l = list(p_dot_paths, p_dot),
      ~ggsave(filename = .x, plot = .y,
              width = 7, height = 7, units = "in", useDingbats = F))
```


Similar idea, but based on calculated column of Missing over background BUSCOs:
```{r}
p_MissingRatio <- bind_rows(enriched) %>%
  ggplot() +
  geom_col(aes(x = reorder(`GO term description`, `Missing over background`),
               y = `Missing over background`,
               fill = `GO category`)) +
  scale_fill_brewer(type = "qual") +
  scale_y_continuous(breaks = c(0,.25,.5,.75,.9), limits = c(0,1)) +
  ylab("Proportion of missing over background BUSCOs") +
  xlab("GO term") +
  #ggtitle("Missing BUSCOs relative to Metazoa set") +
  coord_flip() +
  theme_minimal()
p_MissingRatio
```

Save:
```{r}
ggsave("../figures/enrichment/missing_ratio.pdf", p_MissingRatio,
       width = 6.75, height = 4, units = "in", useDingbats = F)
```


Gene-concept network, plots linkage of genes and enriched GO terms:
```{r}
p_network <- enrich_GO %>%
  purrr::map(~cnetplot(.x, #showCategory = 6, # For 6 top BP terms
                       repel = T, shadowtext = 'all')) #, colorEdge = T))
p_network#[[1]]
```

Only the top 5 significant terms are displayed for simplicity (default), which really only affects BP that has 37 terms total.

Save:
```{r}
p_net_paths <- stringr::str_c("../figures/enrichment/enrichment_network_", names(p_map), ".pdf")
pwalk(.l = list(p_net_paths, p_network),
      ~ggsave(filename = .x, plot = .y,
              width = 10, height = 10, units = "in", useDingbats = F))
```


Directed acyclic graph (DAG) of enriched GO terms:
```{r}
p_dag <- enrich_GO %>%
  purrr::map(~plotGOgraph(.x, firstSigNodes = 10)) #10 is default

# pdf("../figures/enrichment/enrichment_dag_BP2.pdf") # Saves only first cluster, not the 3
# plot(p_dag[[1]]$complete.dag)
# dev.off()
# 
# pdf("../figures/enrichment/enrichment_dag_CC.pdf")
# plot(p_dag[[2]]$complete.dag)
# dev.off()
# 
# pdf("../figures/enrichment/enrichment_dag_MF.pdf")
# plot(p_dag[[3]]$complete.dag)
# dev.off()
```


```{r}
plot(p_dag[[1]]$complete.dag)
```


```{r}
plot(p_dag[[2]]$complete.dag)
```


```{r}
plot(p_dag[[3]]$complete.dag)
```

Only the top 10 significant terms are displayed for simplicity (default), which really only affects BP that has 37 terms total.
Each node represents a GO term and branches represent the containment relationships and the degree of colors represent the extent of enrichment, with black and white ellipses representing non-significant enrichment and yellow to red representing the gradient from higher to lower p.adjust values (red most significant). The numbers under the GO terms are the ratio of missing buscos annotated with that term over the number of buscos annotated with that term in the reference database (Missing over background column of enriched_GO df). Top 10 of significantly enriched terms are represented in boxes and the rest in ellipses. BP: Among the 37 significantly enriched GO BP terms, the higher numbers were in cilium and cell projection.


## Scratch
To quickly look at a GO term: search for id in https://www.ebi.ac.uk/QuickGO/